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(57) Abstract: An enhanced low-bit rate parametric voice coder that groups a number of frames from an underlying frame-based 
vocoder, such as MELP, into a superframe structure. Parameters are extracted from the group of underlying frames and quantized into 
the superframe which allows the bit rate of the underlying coding to be reduced without increasing the distortion. The speech data 
coded in the superframe structure can then be directly synthesized to speech or may be transcoded to a format so that an underlying 
frame-based vocoder performs the synthesis. The superframe structure includes additional error detection and correction data to 
reduce the distortion caused by the communication of bit errors. 
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LPC-HARMONIC VOCODER WITH SUPERFRAME STRUCTURE 
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1998, IEEE, 1998, Vol. 1, pp. 341-344. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates generally to digital communications and, in particular, to 
parametric speech coding and decoding methods and apparatus. 

2. Description of the Background Art 

For the purpose of definition, it should be noted that the term "vocoder" is 
frequently used to describe voice coding methods wherein voice parameters are 
transmitted instead of digitized waveform samples. In the production of digitized 
waveform samples, an incoming waveform is periodically sampled and digitized into a 
stream of digitized waveform data which can be converted back to an analog waveform 
virtually identical to the original waveform. The encoding of a voice using voice 
parameters provides sufficient accuracy to allow subsequent synthesis of a voice which 
is substantially similar to the one encoded. Note that the use of voice parameter 
encoding does not provide sufficient information to exactly reproduce the voice 
waveform, as is the case with digitized waveforms; however the voice can be encoded 
at a lower data rate than is required with waveform samples. 

In the speech coding community, the term "coder" is often used to refer to a 
speech encoding and decoding system, although it also often refers to an encoder by 
itself. As used herein, the term encoder generally refers to the encoding operation of 
mapping a speech signal to a compressed data signal (the bitstream), and the term 
decoder generally refers to the decoding operation where the data signal is mapped into 
a reconstructed or synthesized speech signal. 

Digital compression of speech (also called voice compression) is increasingly 
important for modern communication systems. The need for low bit rates in the range 
of 500 bps (bits per second) to 2 kbps (kilobits per second) for transmission of voice is 
desirable for efficient and secure voice communication over high frequency (HF) and 
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other radio channels, for satellite voice paging systems, for multi-player Internet games, 
and numerous additional applications. Most compression methods (also called "coding 
methods") for 2.4 kbps, or below, are based on parametric vocoders. The majority of 
contemporary vocoders of interest are based on variations of the classical linear 
5 predictive coding (LPC) vocoder and enhancements of that technique, or are based on 
sinusoidal coding methods such as harmonic coders and multiband excitation coders 
[1]. Recently an enhanced version of the LPC vocoder has been developed which is 
called MELP (Mixed Excitation Linear Prediction) [2, 5, 6]. The present invention can 
provide similar voice quality levels at a lower bit rate than is required in the 
10 conventional encoding methods described above. 

This invention is generally described in relation to its use with MELP, since 
MELP coding has advantages over other frame-based coding methods. However the 
invention is applicable to a variety of coders, such as harmonic coders [15], or 
multiband excitation (MBE) type coders [14]. 
15 The MELP encoder observes the input speech and, for each 22.5 ms frame, it 

generates data for transmission to a decoder. This data consists of bits representing line 
spectral frequencies (LSFs) (which is a form of linear prediction parameter), Fourier 
magnitudes (sometimes called "spectral magnitudes), gains (2 per frame), pitch and 
voicing, and additionally contains an aperiodic flag bit, error protection bits, and a 
20 synchronization (sync) bit. FIG. 1 shows the buffer structure used in a conventional 2.4 
kbps MELP encoder. The encoder employed with other harmonic or MBE coding 
methods generates data representing many of the same or similar parameters (typically 
these are LSFs, spectral magnitudes, gain, pitch, and voicing). The MELP decoder 
receives these parameters for each frame and synthesizes a corresponding frame of 
25 speech that approximates the original frame. 

Different communication systems require speech coders with different bit-rates. 
For example, a high frequency (HF) radio channel may have severely limited capacity 
and require extensive error correction and a bit rate of 1.2 kbps may be most suitable for 
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representing the speech parameters, whereas a secure voice telephone communication 
system often requires a bit rate of 2.4 kbps. In some applications it is necessary to 
interconnect different communication systems so that a voice signal originally encoded 
for one system at one bit rate is subsequently converted into an encoded voice signal at 
5 the other bit rate for another system. This conversion is referred to as "transcoding", 
and it can be performed by a "transcoder" typically located at a gateway between two 
communication systems. 

BRIEF SUMMARY OF THE INVENTION 

10 In general terms, the present invention takes an existing vocoder technique, such 

as MELP and substantially reduces the bit rate, typically by a factor of two, while 
maintaining approximately the same reproduced voice quality. The existing vocoder . 
techniques are made use of within the invention, and they are therefore referred to as 
"baseline" coding or alternately "conventional" parametric voice encoding. 

15 By way of example, and not of limitation, the present invention comprises a 1 .2 

kbps vocoder that has analysis modules similar to a 2.4 kbps MELP coder to which an 
additional superframe vocoder is overlayed. A block or "superframe" structure 
comprising three consecutive frames is adopted within the superframe vocoder to more 
efficiently quantize the parameters that are to be transmitted for the 1 .2 kbps vocoder of 

20 the present invention. To simplify the description, the superframe is chosen to encode 
three frames, as this ratio has been found to perform well. It should be noted, however, 
that the inventive methods can be applied to superframes comprising any discrete 
number of frames. A superframe structure has been mentioned in previous patents and 
publications [9], [10], [11], [13]. Within the MELP coding standard, each time a frame 

25 is analyzed (e.g., every 22.5 ms), its parameters are encoded and transmitted. However, 
in the present invention each frame of a superframe is concurrently available in a buffer, 
each frame is analyzed, and the parameters of all three frames within the superframe are 
simultaneously available for quantization. Although this introduces additional encoding 
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delay, the temporal correlation that exists among the parameters of the three frames can 
be efficiently exploited by quantizing them together rather than separately. 

The frame size of the 1 .2 kbps coder of the present invention is preferably 22.5 
ms (or 180 samples of speech) at a sampling rate of 8000 samples per second, which is 
the same as in the MELP standard coder. However, in order to avoid large pitch errors, 
the length of the look-ahead is increased in the invention by 129 samples. In this 
regard, note that the term "look-ahead" refers to the time duration of the "future" speech 
segment beyond the current frame boundary that must be available in the buffer for 
processing needed to encode the current frame. A pitch smoother is also used in the 1 .2 
kbps coder of the present invention, and the algorithmic delay for the 1 .2 kbps coder is 
1 03.75 ms. The transmitted parameters for the 1.2 kbps coder are the same as for the 
2.4 kbps MELP coder. 

Within the MELP coding standard, the low band voicing decision or 
Unvoiced/Voiced decision (U/V decision) is found for each frame. The frame is said to 
be "voiced" when the low band voicing value is "1 ", and "unvoiced" when it is "0". 
This voicing condition determines which of two different bit allocations is used for the 
frame. However, in the 1 .2 kbps coder of the present invention, each superframe is 
categorized into one of several coding states with a different bit allocation for each 
state. State selection is done according to the U/V (unvoiced or voiced) pattern of the 
superframe. If a channel bit error leads to an incorrect state identification by the 
decoder, serious degradation of the synthesized speech for that superframe will result. 
Therefore an aspect of the present invention comprises techniques to reduce the effect 
of state mismatch between encoder and decoder due to channel errors, which techniques 
have been developed and integrated into the decoder. 

In the present invention, three frames of speech are simultaneously available in a 
memory buffer and each frame is separately analyzed by conventional MELP analysis 
modules, generating (unquantized) parameter values for each of the three frames. These 
parameters are collectively available for subsequent processing and quantization. The 
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pitch smoother observes pitch and U/V decisions for the three frames and also performs 
additional analysis on the buffered speech data to extract parameters needed to classify 
each frame as one of two types (onset or offset) for use in a pitch smoothing operation. 
The smoother then outputs modified (smoothed) versions of the pitch decisions, and 
5 these pitch values for the superframe are then quantized. The bandpass voicing 
smoother observes the bandpass voicing strengths for the three frames, as well as 
examines energy values extracted directly from the buffered speech, and then 
determines a cutoff frequency for each of the three frames. The bandpass voicing 
strengths are parameters generated by the MELP encoder to describe the degree of 

10 voicing in each of five frequency bands of the speech spectrum. The cutoff frequencies, 
defined later, describe the time evolution of the bandwidth of the voiced part of the 
speech spectrum. The cutoff frequency for each voiced frame in the superframe is 
encoded with 2 bits. The LSF parameters, Jitter parameter, and Fourier magnitude 
parameters for the superframe are each quantized. Binary data is obtained from the 

15 quantizers for transmission. Not described for the sake of simplicity are the error 

correction bits, synchronization bit, parity bit, and the multiplexing of the bits into a 
serial data stream for transmission, all of which are well-known to those skilled in the 
art. At the receiver, the data bits for the various parameters are extracted, decoded and 
applied to inverse quantizers that recreate the quantized parameter values from the 

20 compressed data. A receiver typically includes a synchronization module which 

identifies the starting point of a superframe, and a means for error correction decoding 
and demultiplexing. The recovered parameters for each frame can be applied to a 
synthesizer. After decoding, the synthesized speech frames are concatenated to form 
the speech output signal. The synthesizer may be a conventional frame-based 

25 synthesizer, such as MELP, or it may be provided by an alternative method as disclosed 
herein. 

An object of the invention is to introduce greater coding efficiencies and exploit 
the correlation from one frame of speech to another by grouping frames into 
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superframes and performing novel quantization techniques on the superframe 
parameters. 

Another object of the invention is to allow the existing speech processing 
functions of the baseline encoder and decoder to be retained so that the enhanced coder 
5 operates on the parameters found in the baseline coder operation, thereby preserving the 
wealth of experimentation and design results already obtained with baseline encoders 
and decoders while still offering greatly reduced bit rates. 

Another object of the invention is to provide a mechanism for transcoding, 
wherein a bit stream obtained from the enhanced encoder is converted (transcoded) into 

10 a bit stream that will be recognized by the baseline decoder, while similarly providing a 
way to convert the bit stream coming from a baseline encoder into a bit stream that can 
be recognized by an enhanced decoder. This transcoding feature is important in 
applications where terminal equipment implementing a baseline coder/decoder must 
communicate with terminal equipment implementing the enhanced coder/decoder. 

15 Another object of the invention is to provide methods for improving the 

performance of the MELP encoder by wherein new methods generate pitch and voicing 
parameters. 

Another object of the invention is to provide a new decoding procedure that 
replaces the MELP decoding procedure and substantially reduces complexity while 
20 maintaining the synthesized voice quality. 

Another object of the invention is to provide a 1 .2 kbps coding scheme that 
gives approximately equal quality to the MELP standard coder operating at 2.4 kbps. 

Further objects and advantages of the invention will be brought out in the 
following portions of the specification, wherein the detailed description is for the 
25 purpose of fully disclosing preferred embodiments of the invention without placing 
limitations thereon. 
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BRIEF DESCRIPTION OF THE DRAWINGS 
The invention will be more fully understood by reference to the following 
drawings which are for illustrative purposes only: 
5 FIG. 1 is a diagram of data positions used within the input speech buffer 

structure of a conventional 2.4 kbps MELP coder. The units shown indicate samples of 
speech. 

FIG. 2 is a diagram of data positions used within the input superframe speech 
buffer structure of the 1.2 kbps coder of the present invention. The units shown indicate 
10 samples of speech. 

FIG. 3 A is a functional block diagram of the 1 .2 kbps encoder of the present 
invention. 

FIG. 3B is a functional block diagram of the 1 .2 kbps decoder of the present * 
invention. 

15 FIG- 4 is a diagram of data positions within the 1 .2 kbps encoder of the present 

invention showing computation positions for computing pitch smoother parameters 
within the present invention, where the units shown indicate samples of speech. 

FIG. 5 A is a functional block diagram of a 1200 bps stream up-converted by a 
transcoder into a 2400 bps stream. 
20 FIG. 5B is a functional block diagram of a 2400 bps stream down-converted by 

an transcoder into a 1 200 bps stream. 

FIG. 6 is a functional block diagram of hardware within a digital vocoder 
terminal which employs the inventive principles in accord with the present invention. 

25 DETAILED DESCRIPTION OF THE INVENTION 

For illustrative purposes the present invention will be described with reference 
to FIG. 2 through FIG. 6. It will be appreciated that the apparatus may vary as to 
configuration and as to details of the parts, and that the method may vary as to the 
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specific steps and sequence, without departing from the basic concepts as disclosed 
herein. 

1- OVERVIEW OF THE VOCODER 

The 1.2 kbps encoder of the present invention employs analysis modules similar 
5 to those used in a conventional 2,4 kbps MELP coder, but adds a block or "superframe" 
encoder which encodes three consecutive frames and quantizes the transmitted 
parameters more efficiently to provide the L2 kbps vocoding. Those skilled in the art 
will appreciate that although the invention is described with reference to using three 
frames per superframe, the method of the invention can be applied to superframes 
10 comprising other integral numbers of frames as well. Furthermore, those skilled in the 
art will also appreciate that although the invention is described with respect to the use of 
MELP as the baseline coder, the methods of the invention can be applied to other 
harmonic vocoders. Such vocoders may have a similar, but not identical, set of 
parameters extracted from analysis of a speech frame and the frame size and bit rates 
15 may be different from those used in the description presented here. 

It will be appreciated that when a frame is analyzed within a MELP encoder, 
(e.g. every 22.5 ms), voice parameters are encoded for each frame and then transmitted. 
Yet, in the present invention, data from a group of frames, forming a superframe, is 
collected and processed with the parameters of all three frames in the superframe which 
20 are simultaneously available for quantization. Although this introduces additional 

encoding delay, the temporal correlation that exists among the parameters of the three 
frames can be efficiently exploited by quantizing them together rather than separately. 

The frame size employed in the present invention is preferably 22.5 ms (or 180 
samples of speech) at a sampling rate of 8000 samples per second, which is the same 
25 sample rate used in the original MELP coder. The buffer structure of a conventional 2.4 
kbps MELP is shown in FIG. 1 . The length of look-ahead buffer has been increased in 
the preferred embodiment by 129 samples, so as to reduce the occurrence of large pitch 
errors, although the invention can be practiced with various levels of look-ahead. 
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Additionally, a pitch smoother has been introduced to further reduce pitch errors. The 
algorithmic delay for the 1 .2 kbps coder described is 103.75 ms. The transmitted 
parameters for the 1 .2 kbps coder are the same as for the 2.4 kbps MELP coder. The 
buffer structure of the present invention can be seen in FIG. 2. 
5 1.1 Bit Allocation 

When using MELP coding, the low band voicing decision, or U/V decision, is 
found for each "voiced" frame when the low band voicing value is 1 and unvoiced 
when it is 0. However in the 1 .2 kbps coder of the present invention each superframe is 
categorized into one of several coding states employing different quantization schemes. 

10 State selection is performed according to the U/V pattern of the superframe. If a 
channel bit error leads to an incorrect state identification by the decoder, serious 
degradation of the synthesized speech for that superframe will result. Therefore, 
techniques to reduce the effect of state mismatch between encoder and decoder due to 
channel errors have been developed and integrated into the decoder. For comparison 

15 purposes, the bit allocation schemes for both the 2.4 kbps (MELP) coder and the 1 .2 
kbps coder are shown in Table 1 . 

FIG. 3 A is a general block diagram of the 1 .2 kbps coding scheme 10 in accord 
with the present invention. Input speech 12 fills a memory buffer called a superframe 
buffer 14 which comprises a superframe and in addition stores the history samples that 

20 preceded the start of the oldest of the three frames and the look-ahead samples that 

follow the most recent of the three frames. The actual range of samples stored in this 
buffer for the preferred embodiment are as shown in FIG 2. Frames within the 
superframe buffer 14 are separately analyzed by conventional MELP analysis modules 
16, 18, 20 which generate a set of unquantized parameter values 22 for each of the 

25 frames within the superframe buffer 14. Specifically, a MELP analysis module 16 
operates on the first (oldest) frame stored in the superframe buffer, another MELP 
analysis module 1 8 operates on the second frame stored in the buffer, and another 
MELP analysis module 20 operates on the third (most recent) frame stored in the buffer. 
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Each MELP analysis block has access to a frame plus prior and future samples 
associated with that frame. The parameters generated by the MELP analysis modules 
are collected to form the set of unquantized parameters stored in memory unit 22, which 
is available for subsequent processing and quantization. The pitch smoother 24 
5 observes pitch values for the frames within the superframe buffer 1 4, in conjunction 
with a set of parameters computed by the smoothing analysis block 26 and outputs 
modified versions of the pitch values when the output is quantized 28. A bandpass 
voicing smoother 30 observes an average energy value computed by the energy analysis 
module 32 and it also observes the bandpass voicing strengths for the frames within the 

1 0 superframe buffer 1 4 and suitably modifies them for subsequent quantization by the 
bandpass voicing quantizer 32. An LSP quantizer 34, Jitter quantizer 36, and Fourier 
magnitudes quantizer 38 each output encoded data. Encoded binary data is obtained 
from the quantizers for transmission. Not shown for simplicity are the generation of 
error correction data bits, a synchronization bit, and multiplexing of the bits into a serial 

1 5 data stream for transmission which those skilled in the art will readily understand how 
to implement. 

At the decoder 50, shown in FIG. 3B, the data bits for the various parameters are 
contained in the channel data 52 which enters a decoding and inverse quantizer 54, 
which extracts, decodes and applies inverse quantizers to recreate the quantized 

20 parameter values from the compressed data. Not shown are the synchronization module 
(which identifies the starting point of a superframe) and the error correction decoding 
and demultiplexing which those skilled in the art will readily understand how to 
implement. The recovered parameters for each frame are then applied to conventional 
MELP synthesizers 56, 58, 60. It should be noted that this invention includes an 

25 alternative method of synthesizing speech for each frame that is entirely different from 
the prior art MELP synthesizer. After being decoded, the synthesized speech frames 62, 
64, 66 are concatenated to form the speech output signal 68. 
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2. SPEECH ANALYSIS 

2.1 Overview 

The basic structure of the encoder is based on the same analysis module used in 
the 2.4 kbps MELP coder except that a new pitch smoother and bandpass- voicing 
5 smoother are added to take advantage of the superframe structure. The coder extracts 
the feature parameters from three successive frames in a superframe using the same 
MELP analysis algorithm, operating on each frame, as used in the 2.4 kbps MELP 
coder. The pitch and bandpass voicing parameters are enhanced by smoothing. This 
enhancement is possible because of the simultaneous availability of three adjacent 
10 frames and the look-ahead. By operating in this manner on the superframe, the 

parameters for all three frames are available as input data to the quantization modules, 
thereby allowing more efficient quantization than is possible when each frame is 
separately and independently quantized. 

2.2 Pitch Smoother 

15 The pitch smoother takes the pitch estimates from the MELP analysis module 

for each frame in the superframe and a set of parameters from the smoothing analysis 
module 26 shown in FIG. 3 A. The smoothing analysis module 26 computes a set of 
new parameters every half frame (1 1 .25 ms) from direct observation of the speech 
samples stored in the superframe buffer. The nine computation positions in the current 

20 superframe are illustrated in FIG. 4. Each computation position is at the center of a 
window in which the parameters are computed. The computed parameters are then 
applied as additional information to the pitch smoother. 

In the 1 .2 kbps encoder, each frame is classified into two categories, comprising 
either onset or offset frames in order to guide the pitch smoothing process. The new 

25 waveform feature parameters computed by the smoothing analysis module 26, and then 
used by the pitch smoother module 24 for the onset/offset classification, are as follows: 
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Description 
energy in dB 
zero crossing rate 
peakiness measurement 

maximum correlation coefficient of input speech 

maximum correlation coefficient of 500Hz low pass filtered speech 

Energy of low pass filtered speech 

Energy of high pass filtered speech 



Abbreviation 
subEnergy 
zeroCrosRate 
peakiness 
corx 

lowBandCorx 

lowBandEn 

highBandEn 



10 



Input speech is denoted as x(«), n =...,0, 1,... . where x(0) corresponds to the speech 
sample that is 45 samples to the left of the current computation position, and n is 90 
samples, which is half of the frame size. The parameters are computed as following 
(1) Energy: 



Z* 2 («) 



(2) 



subEnergy = 1 01og l0 
Zero crossing rate: 

zeroCrosRate = ]T[x(/>x(/ + 1) > 0?0:l] 



/=0 



where the expression in square brackets has value 1 when the product x(i)*x(i+ 1) is 
negative (i.e., when a zero crossing occurs) and otherwise it has value zero. 
(3) Peakiness measurement in speech domain: 



peakiness = 



N-\ 



n=0 
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The peakiness measure is defined as in the MELP coder [5], however, here this measure 
is computed from the speech signal itself, whereas in MELP it is computed from the 
prediction residual signal that is derived from the speech signal. 



5 (4) Maximum correlation coefficient in pitch search range: 

First the input speech signal is passed through a low-pass filter with an 800Hz 
cutoff frequency, where: 

//(z) = 0.3069/(l-2.4552z- 1 + 2.4552z" 2 - 1.152Z" 3 + 0.2099Z" 4 ) 

The low-pass filtered signal is passed through a 2 nd order LPC inverse filter. The 
10 inverse filtered signal is denoted as s lv (n) . The DC component is removed from 
fiv( n ) to obtain s iv (n) . Then, the autocorrelation function is computed by: 

^ k =20,. ..,150 



\M-\ M-) 



n=0 n=0 



where M=70. The samples are selected using a sliding window chosen to align the 
current computation position to the center of the autocorrelation window. The 
15 maximum correlation coefficient parameter corx is the maximum of the function r k . 
The corresponding pitch is /. 

corx = max r, I- arg maxr. 



(5) Maximum correlation coefficient of low pass filtered speech: 

20 In the standard MELP, five filters are used in bandpass voicing analysis. The 

first filter is actually a low-pass filter with passband of 0-500Hz. The same filter is 
used on input speech to generate the low-pass filtered signal s g (n) . Then the 
correlation function defined in (4) is computed on s § (ri) . The range of the indices is 
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limited by [max(20,/ - s\ min(l 50,/ + 5)] . The maximum of the correlation function is 
denoted as lowBandCorx. 

(6) Low band energy and high band energy: 

5 In the LPC analysis module, the first 1 7 autocorrelation coefficients 

r(«), n = 0,...,16 are computed. The low band energy and high band energy are 
obtained by filtering the autocorrelation coefficients. 

16 



lowBandEn = r(0) • C,(0) + 2]£ r(n) • C,(n) 
highBandEn = r (0) - C h (0) + 2]T r(n) . Q (n) 



n=l 



The C,(«) and are the coefficients for low pass filter and the high pass filter. The 

10 16 filter coefficients for each filter are chosen for a cutoff frequency of 2 kHz and are 

obtained with a standard FIR filter design technique. 

The parameters enumerated above are used to make rough U/V decisions for 

each half frame. The classification logic for making the voicing decisions shown below 

is performed in the pitch smoother module 24. The voicedEn and silenceEn are the 
15 running average energies of voiced frames and silence frames. 

structure { 

subEnergy; r energy in dB */ 

zeroCorsRate; /* zero crossing rate */ 

peakiness; r peakiness measurement */ 

20 corx ; ** maximum correlation coefficient of input speech */ 

lowBandCorx; I* maximum correlation coefficient of 

500Hz low pass filtered speech */ 

lowBandEn; r Energy of low pass filtered speech */ 

highBandEn; r Energy of high pass filtered speech */ 
25 } classStat[9]; 

if( classStat -> subEnergy < 30 ){ 

classy = SILENCE; 
}else if( classStat -> subEnergy < 0.35*voicedEn + 0.65*silenceEn ){ 
30 if( (classStat->zeroCrosRate > 0.6) && 
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((classStat->corx<0.4) || (classStat->lowBandCorx < 0 5)) ) 
classy = UNVOICED; 
else if( (classStat->lowBandCorx > 0.7) || 

((classStat->lowBandCorx > 0.4) && (classStat->corx > 0 7)) ) 
5 classy = VOICED; 

else if( (classStat->zeroCrosRate-classStat[-1].zeroCrosRate>0.3) || 

(classStat->subEnergy - classStat[-1].subEnergy > 20) || 
(classStat->peakiness > 1.6) ) 
classy = TRANSITION; 
10 else if((ciassStat->zeroCrosRate > 0.55) || 

((c!assStat->highBandEn > classStat->lowBandEn-5) && 
(classStat->zeroCrosRate > 0.4)) ) 
classy = UNVOICED; 
else classy = SILENCE; 

15 }else{ 

if( (ciassStat->zeroCrosRate - classStat[-1].zeroCrosRate > 0.2) || 

(classStat->subEnergy - classStat[-1].subEnergy > 20) || 
(classStat->peakiness > 1.6) ){ 
if( (ciassStat->lowBandCorx > 0.7) || (classStat->corx > 0.8) ) 
20 classy = VOICED; 

else 

classy = TRANSITION; 
}else if( classStat -> zeroCrosRate < 0.2 ){ 
if( (classStat->lowBandCorx > 0.5) || 
25 ((classStat->lowBandCorx > 0.3) && (classStat->corx > 0.6)) 

classy = VOICED; 
else if( classStat->subEnergy > 0.7*voicedEn+0.3*silenceEn ){ 
if( classStat->peakiness > 1.5 ) 
classy = TRANSITION; 

30 else{ 

classy = VOICED; 

} 

}else{ 

classy = SILENCE; 

35 } 

}else if( classStat -> zeroCrosRate < 0.5 ){ 
if( (classStat->lowBandCorx > 0.55) || 

((classStat->lowBandCorx > 0.3) && (classStat->corx > 0.65)) ) 
classy = VOICED; 

40 else rf( (classStat->subEnergy < 0.4*voicedEn+0.6*silenceEn) && 

(classStat->highBandEn < classStat->lowBandEn-10) ) 
classy = SILENCE; 
else if( classStat->peakiness > 1.4) 
classy = TRANSITION; 

45 else 
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classy = UNVOICED; 
}else if( classStat -> zeroCrosRate < 0.7 ){ 

if( ((classStat->lowBandCorx > 0.6) && (classStat->corx > 0.3)) || 
((classStat->lowBandCorx > 0.4) && (classStat->corx > 0.7)) ) 

else if( classStat->peakiness > 1.5 ) 
classy = TRANSITION; 

else 

classy = UNVOICED; 

10 }else{ 

if( ((classStat-HowBandCorx > 0.65) && (classStat->corx > 0.3)) || 
((classStat->lowBandCorx > 0.45) && (classStat->corx > 0.7)) ) 
classy = VOICED; 
else if( classStat->peakiness > 2.0 ) 
15 classy = TRANSITION; 

else 

classy = UNVOICED; 

} 

} 

20 

The U/V decisions for each subframe are then used to classify the frames as 
onset or offset. This classification is internal to the encoder and is not transmitted. For 
each current frame, first the possibility of an offset is checked. An offset frame is 
selected if the current voiced frame is followed by a sequence of unvoiced frames, or 

25 the energy declines at least 8 dB within one frame or 12 dB within one and one-half 
frames. The pitch of an offset frame is not smoothed. 

If the current frame is the first voiced frame, or the energy increases by at least 8 
dB within one frame or 12 dB within one and one-half frames, the current frame is 
classified as an onset frame. For the onset frames, a look-ahead pitch candidate is 

30 estimated from one of the local maximums of the autocorrelation function evaluated in 
the look-ahead region. First, the 8 largest local maximums of the autocorrelation 
function given above are selected. The maximums are denoted for the current 
computation position as * (0) (/), i = 0,...,7 . The maximums for the next two 
computation positions are R 0) '(/), R (2) (/) . A cost function for each computation 

35 position is computed, and the cost function for the current computation position is used 
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to estimate the predicted pitch. The cost function for tf< 2 >(/) is computed first as: 

C w (i) = w[l-R^(i)] 

where W is a constant which is 100. For each maximum J? (,) (i) , the corresponding 
pitch is denoted as p 0) (i) . The cost function C (,) (/) is computed as: 
5 C (1 >(i) = ^l-J?< 1 >(/)] + | / ,<') (0 _ /7 (2, (yt) | + C ( 2)( ^ ) 

The index k i is chosen as: 

k,. = argmax(tf 2 (/)) \p«\l) - p™ P 0) (0 < .2 

If the range for / is an empty set in the above equation, then we use range / e [0,7] . The 

cost function C (0) (/) is computed in a similar way as the C (,) (j). The predicted pitch is 
10 chosen as 

p = arg max(c (0) (/)) / = 0, . . . ,7 

The look-ahead pitch candidate is selected as current pitch, if the difference between the 
original pitch estimate and the look-ahead pitch is larger than 15%. 

If the current frame is neither offset nor onset, the pitch variation is checked. If 

1 5 a pitch jump is detected, which means the pitch decreases and then increases or 
increases and then decreases, the pitch of the current frame is smoothed using 
interpolation between the pitch of the previous frame and the pitch of the next frame. 
For the last frame in the superframe the pitch of the next frame is not available, 
therefore a predicted pitch value is used instead of the next frame pitch value. The 

>0 above pitch smoother detect many of the large pitch errors that would otherwise occur 
and in formal subjective quality tests, the pitch smoother provided significant quality 
improvement. 
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2.3 Bandpass Voicing Smoother 

In MELP encoding, the input speech is filtered into five subbands. Bandpass 
voicing strengths are computed for each of these subbands with each voicing strength 
5 normalized to a value of between 0 and 1 . These strengths are subsequently quantized 
to 0s or Is, to obtain bandpass voicing decisions. The quantized lowband (0 to 500 Hz) 
voicing strength determines the unvoiced, or voiced, (U/V) character of the frame. The 
binary voicing information of the remaining four bands partially describes the harmonic 
or nonharmonic character of the spectrum of a frame and can be represented by a four 
10 bit codeword. In this invention, a bandpass voicing smoother is used to more compactly 
describe this information for each frame in a superframe and to smooth the time 
evolution of this information across frames. First the four bit codeword is mapped (1 
for voiced, 0 for unvoiced) for the remaining four bands for each frame into a single 
cutoff frequency with one of four allowed values. This cutoff frequency approximately 
1 5 identifies the boundary between the lower region of the spectrum that has a voiced (or 
harmonic) character and the higher region that has an unvoiced character. The 
smoother then modifies the three cutoff frequencies in the superframe to produce a 
more natural time evolution for the spectral character of the frames. The 4-bit binary 
voicing codeword for each of the frame decisions is mapped into four codewords using 
20 the 2-bit codebook shown in Table 2. The entries of the codebook are equivalent to the 
four cutoff frequencies: 500 Hz, 1000 Hz, 2000 Hz and 4000 Hz which correspond 
respectively to the columns labeled: 0000, 1000, 1 100, and 1 1 1 1 in the mapping table 
given in Table 2. For example, when the bandpass voicing pattern for a voiced frame is 
1001, this index is mapped into 1000, which corresponds to a cutoff frequency of 1000 
25 Hz. 

For the first two frames of the current superframe, the cutoff frequency is 
smoothed according to the bandpass voicing information of the previous frame and the 
next frame. The cutoff frequency in the third frame is left unchanged. The average 
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energy of voiced frames is denoted as VE. The value of VE is updated at each voiced 
frame for which the two prior frames are voiced. The updating rule is: 
VE neH = 101og 10 [o.9^- /10 + o.l^ £n "* v/, °] 

For the frame / , the energy of the current frame is denoted as en, . The voicing 
strengths for the five bands are denoted as bp[k], , k = 1,...,5. The following three 
conditions are considered to smooth the cutoff frequency f. . 

(1) If the cutoff frequencies of the previous frame and the next frame are 
both above 2000 Hz, then execute the following procedure. 

If ( /. < 2000 and ({en, > VE-5dB) or (bp[2],_ } > Oiandi^],., > 05 ) ) ) 

/, = 2000 Hz 
else if ( f g < 1000) 

/, = 1000 Hz 

(2) If the cutoff frequencies of the previous frame and the next frame are 
both above 1000 Hz, then execute the following procedure. 

If ( /, < 1000 and ( (en, > VE - 10 dB) or (bp[2\_ x > 0.4) ) ) 
/■ = 1000 Hz 

(3) If the cutoff frequencies of the previous frame and the next frame are all 
below 1000Hz, then execute the following procedure. 

If ( /. >2000 and en, < VE-5 dB and bp[3]^ <0.7 ) 
/ = 2000 Hz 
3. QUANTIZATION 
3.1 Overview 

The transmitted parameters of the 1 .2 kbps coder are the same as those of the 2.4 
kbps MELP coder except that in the 1 .2 kbps coder the parameters are not transmitted 
frame by frame but are sent once for each superframe. The bit-allocation is shown in 
Table 1. New quantization schemes were designed to take advantage of the long block 
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size (the superframe) by using interpolation and vector quantization (VQ). The 
statistical properties of voiced and unvoiced speech are also taken into account. The 
same Fourier magnitude codebook of the 2.4 MELP kbps coder is used in the L2 kbps 
coder in order to save memory and to make the transcoding easier. 
5 3.2 Pitch Quantization 

The pitch parameters are applicable only for voiced frames. Different pitch 
quantization schemes are used for different U/V combinations across the three frames. 
The detailed method for quantizing the pitch values of a superframe is herein described 
for a particular voicing pattern. The quantization method described in this section is 
0 used in the joint quantization of the voicing pattern, while the pitch will be described in 
the following section. The pitch quantization schemes are summarized in Table 3. 
Within those superframes where the voicing pattern contains either two or three voiced 
frames, the pitch parameters are vector-quantized. For voicing patterns containing only 
one voiced frame, the scalar quantizer specified in the MELP standard is applied for the 
5 pitch of the voiced frame. For the UUU voicing pattern, where each frame is unvoiced, 
no bits are needed for pitch information. Note that U denotes "Unvoiced" and V 
denotes "Voiced". 

Each pitch value, P, obtained from the pitch analysis of the 2.4 kbps standard is 
transformed into a logarithmic value,/? = log P, before quantization. For each 

D superframe, a pitch vector is constructed with components equal to the log pitch value 
for each voiced frame and a zero value for each unvoiced frame. For voicing patterns 
with two or three voiced frames, the pitch vector is quantized using a VQ (Vector 
Quantization) algorithm with a new distortion measure that takes into account the 
evolution of the pitch. This algorithm incorporates pitch differentials in the codebook 

> search, which makes it possible to consider the time evolution of the pitch. A standard 
VQ codebook design is used [7]. The VQ encoding algorithm incorporates pitch 
differentials in the codebook search, which makes it possible to consider the time 
evolution of the pitch in selecting the VQ codebook entry. This feature is motivated by 
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the perceptual importance of adequately tracking the pitch trajectory. The algorithm 
has three steps for obtaining the best index: 

Step 1 : Select the M-best candidates using the weighted squared Euclidean 
distance measure: 

1=1 

where w — < if the corresponding frame is voiced 

IA if the corresponding frame is unvoiced. 

and p i is the unquantized log pitch, p i is the quantized log pitch value. The above 
equation indicates that only voiced frames are taken into consideration in the codebook 
search. 

1 0 Step 2 : Calculate differentials of the unquantized log pitch values using: 

u a I Pi - P*-\ if the i-th and (i - 1 ) - th frames are voiced 

where A Pi f = \ 

[0 else ( 2 ) 

for / = 1, 2, 3, where p 0 is the last log pitch value of the previous superframe. For the 
candidate log pitch values selected in step 1, calculate differentials of the candidates by 
replacing Ap ; and by Ap, and p i respectively in equation (2), where p 0 is the 
1 5 quantized version of p 0 . 

Step 3 : Select the index from the M best candidates that minimizes: 

^ = Z^|A-A| 2 +^Z |4p,-4&| 2 + |Ap,-Ap,.| 2 (3) 
' =1 /=1 

where S is a parameter to control the contribution of pitch differentials which is set to 

be 1. 

20 For superframes that contain only one voiced frame, scalar quantization of the 

pitch is performed. The pitch value is quantized on a logarithmic scale with a 99-level 
uniform quantizer ranging from 20 to 160 samples. The quantizer is the same as that in 
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the 2.4 kbps MELP standard, where the 99 levels are mapped to a 7 bit pitch codeword 
and the 28 unused codewords with Hamming weight 1 or 2 are used for error protection. 
3.3 Joint Quantization of Pitch and U/V Decisions 
The U/V decisions and pitch parameters for each superframe are jointly 
quantized using 12 bits. The joint quantization scheme is summarized in Table 4. In 
other words, the voicing pattern or mode (one of 8 possible patterns) and the set of three 
pitch values for the superframe form the input to a joint quantization scheme whose 
output is a 12 bit word. The decoder subsequently maps this 12 bit word by means of a 
table lookup into a particular voicing pattern and a quantized set of 3 pitch values. 

In this scheme, the allocation of 12-bits consists of 3 mode bits (representing the 
8 possible combinations of U/V decisions for the 3 frames in a superframe) and the 
remaining 9 bits for pitch values. The scheme employs six separate pitch codebooks, 
five having 9 bits (i.e. 512 entries each) and one being the scalar quantizer as indicated 
in Table 4; the specific codebook is determined according to the bit patterns of the 3-bit 
codeword representing the quantized voicing pattern. Therefore the U/V voicing 
pattern is first encoded into a 3-bit codeword as shown in Table 4, which is then used to 
select one of the 6 codebooks shown. The ordered set of 3 pitch values is then vector 
quantized with the selected codebook to generate a 9- bit codeword that identifies the 
quantized set of 3 pitch values. Note that four codebooks are assigned to the 
superframes in the VVV (voiced-voiced-voiced) mode, which means that the pitch 
vectors in the VW type superframes are each quantized by one of 2048 codewords. If 
the number of voiced frames in the superframe is not larger than one, the 3-bit 
codeword is set to 000 and the distinction between different modes is determined within 
the 9-bit codebook. Note that the latter case consists of the 4 modes UUU, VUU, UVU, 
and UUV (where U denotes an unvoiced frame and V a voiced frame and the three 
symbols indicate the voicing status of the ordered set of 3 frames in a superframe). In 
this case, the 9 available bits are more than sufficient to represent the mode information 
as well as the pitch value since there are 3 modes with 128 pitch values and one mode 
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with no pitch value. 

3.4 Parity Bit 

To improve robustness to transmission errors, a parity check bit is computed and 
transmitted for the three mode bits (representing voicing patterns) in the superframe as 
defined above in Section 3.3. 

3.5 LSF Quantization 

The bit allocation for quantizing the line spectral frequencies (LSF's) is shown 
in Table 5, with the original LSF vectors for the three frames denoted by l u / 2 , / 3 . For 
the UUU, UUV, UVU and VUU modes, the LSF vectors of unvoiced frames are 
quantized using a 9-bit codebook, while the LSF vector of the voiced frame is quantized 
with a 24 bit multistage VQ (MSVQ) quantizer based on the approach described in [8]. 

The LSF vectors for the other U/V patterns are encoded using the following 
forward-backward interpolation scheme. This scheme works as follows: The quantized 
LSF vector of the previous frame is denoted by l p . First the LSF's of the last frame in 
the current superframe, / 3) is directly quantized to / 3 using the 9-bit codebook for 
unvoiced frames or the 24 bit MSVQ for voiced frames. Predicted values of /, and l 2 
are then obtained by interpolating l p and / 3 using the following equations: 

?C0 = U) * K (") + [1 - 0' )] • h U) 

~ (4) 

hU) = o 2 U) ~i P U)+[\-a 2 (j)\hU) y = i,...,io 

where a x (j) and a 2 (j) are the interpolation coefficients. 

The design of the MSVQ (multistage vector quantization) codebooks follows the 
procedure explained in [8]. 

The coefficients are stored in a codebook and the best coefficients are selected 
by minimizing the distortion measure: 

where the coefficients Wi(j) are the same as in the 2.4 kbps MELP standard. After 
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10 



15 



obtaining the best interpolation coefficients, the residual LSF vector for frames 1 and 2 
are computed by: 

r l (J) = Uj)-T ! (J) 

r,U) = kU)-lu) y = i,...,io (6) 

The 20-dimension residual vector R = [r 1 (l),r 1 (2),...,r 1 (10),r 2 (l),r 2 (2),...,r 2 (10)] is then 

quantized using weighted multi-stage vector quantization. 

3.6 Method for Designing the Interpolation Codebook 

The interpolation coefficients were obtained as follows. The optimal 

interpolation coefficients for each superframe were computed by minimizing the 

weighted mean square error between /,, l 2 and l iU l a which can be shown to result in: 

. _ ^,0)[/30)- /,0)1[/30)-/ p O)] 

Each entry of the training database for the codebook design employs the 40-dimension 

vector 0 p ,h,h,h ), and the training procedure described below. 

The database is denoted as L = {l^J^l^Xjn = 0 s 2,...,A^-l}, where 

l^(l),--.,^(10),/ M (l),...,/ 1> ,,(10) > / u (l),...,/ u (10),/3 jfl (l),...,/3^(lO)l is a 40 dimension 
vector. The output codebook is C = {(a Xm ,a 2jn \m = 0,...M-\], where {a^,a ljn ) = 

is a 20-dimension vector. 
3.6. 1 The two main procedures of the codebook training are now described. 
Given the codebook C = {(a, , m ,a 2m \ m = 0,...M'-\}, each database entry = 
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15 



(V« > A.„ » 7 2.« > ) is associated to a particular centroid. The equation below is used to 
compute the error function between the entry (input vector) and each centroid in the 
codebook. The entry L n is associated to the centroid which gives the smallest error. 
This step defines a partition on the input vectors. 



+ X "2 t/> - k- c/y„ a) + (i - a ljm uAn ui)j 



(8) 



3.6.2 Given a particular partition, the codebook is updated. Assume N'. 
database entries are associated to the centroid A m =(a lm ,a 2 J , then the centroid is 
updated using the following equation: 



S 0")[C O") - A,, O)] • [/,„ CO - /„„ U )l 
^(7) = -=^ 



Z w ..»o)[^o)-^(y)f 

»=0 



(9) 



S w 2.0)[^0')-/^(y)] 2 

10 The interpolation coefficients codebook was trained and tested for several codebook 

sizes. A codebook with 16 entries was found to be quite efficient. The above procedure 
is readily understood by engineers familiar with the general concepts of vector 
quantization and codebook design as described in [7]. 
3.7 Gain Quantization 

In the 1 .2 kbps coder, two gain parameters are calculated per frame, with 6 gains 
per superfirame. The 6 gain parameters are vector-quantized using a 1 0 bit vector 
quantizer with a MSE criterion defined in the logarithmic domain. 
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3.8 Bandpass Voicing Quantization 

The voicing information for the lowest band out of the total of 5 bands is 
determined from the U/V decision. The voicing decisions of the remaining 4 bands are 
employed only for voiced frames. The binary voicing decisions (1 for voiced and 0 for 
5 unvoiced) of the 4 bands are quantized using the 2-bit codebook shown in Table 2. This 
procedure results in two bits being used for voicing in each voiced frame. The bit 
allocation required in different coding modes for bandpass voicing quantization is 
shown in Table 6. 

3.9 Quantization of Fourier Magnitudes 

10 The Fourier magnitude vector is computed only for voiced frames. The 

quantization procedure for Fourier magnitudes is summarized in Table 7. The 
unquantized Fourier magnitude vectors for the three frames in a superframe are denoted 
as / , / = 1,2,3 . Denoted by /„ is the Fourier magnitude vector of the last frame in the 
previous superframe, / denotes the quantized vector /„ and Q(.) denotes the quantizer 

1 5 function for the Fourier magnitude vector when using the same 8-bit codebook as used 
within the MELP standard. The quantized Fourier magnitude vectors for the three 
frames in a superframe are obtained as shown in Table 7. 

3.10 Aperiodic flag quantization 

The 1 .2 kbps coder uses 1-bit per superframe for the quantization of the 
20 aperiodic flag. In the 2.4 kbps MELP standard, the aperiodic flag requires one bit per 

frame, which is three bits per superframe. The compression to one bit per superframe is 
obtained using the quantization procedure shown in Table 8. In the table, "J" and "-" 
indicate respectively the aperiodic flag states of set and not set. 

3.11 Error Protection 

25 3.11.1 Mode protection 

Aside from the parity bit, additional mode error protection techniques are 
applied to superframes by employing the spare bits that are available in all superframes 
except the superframes in the VW mode. The 1 .2 kbps coder uses two bits for the 
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quantization of the bandpass voicing for each voiced frame. Hence, in superframes that 
have one unvoiced frame, two bandpass voicing bits are spare and can be used for mode 
protection. In superframes that have two unvoiced frames, four bits can be used for 
mode protection. In addition 4 bits of LSF quantization are used for mode protection in 
5 the UUU and VVU modes. Table 9 shows how these mode protection bits are used. 
Mode protection implies protection of the coding state, which was described in Section 
1.1. 

3.11.2 Forward Error Correction for UUU Superframe 

In the UUU mode, the first 8 MSB's of the gain index are divided into two 
10 groups of 4 bits and each group is protected by the Hamming (8,4) code. The remaining 
2 bits of the gain index are protected with the Hamming (7,4) code. Note that the 
Hamming (7,4) code corrects single bit-errors, while the (8,4) code corrects single bit 
errors and in addition detects double bit-errors. The LSF bits for each frame in the 
UUU superframes are protected by a cyclic redundancy check (CRC) with a CRC (13,9) 
15 code which detects single and double bit-errors. 
4. DECODER 

4.1 Bit Unpacking and Error Correction 

Within the decoder, the received bits are unpacked from the channel and 
assembled into parameter codewords. Since the decoding procedures for most 

20 parameters depend on the mode (the U/V pattern), the 12 bits allocated for pitch and 

U/V decisions are decoded first. For the bit pattern 000 in the 3-bit codebook, the 9-bit 
codeword specifies one of the UUU, UUV, UVU, and VUU modes. If the code of the 
9-bit codebook is all-zeros, or has one bit set, the UUU mode is used. If the code has 
two bits set, or specifies an index unused for pitch, a frame erasure is indicated. 

25 After decoding the U/V pattern, the resulting mode information is checked using 

the parity bit and the mode protection bits. If an error is detected, a mode correction 
algorithm is performed. The algorithm attempts to correct the mode error using the 
parity bits and mode protection bits. In the case that an uncorrectable error is detected, 
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different decoding methods are applied for each parameter according to the mode error 
patterns. In addition, if a parity error is found, a parameter-smoothing flag is set. The 
correction procedures are described in Table 10. 

In the UUU mode, assuming no errors were detected in the mode information, 
5 the two (8,4) Hamming codes representing the gain parameters are decoded to correct 
single bit errors and detect double errors. If an uncorrectable error is detected, a frame 
erasure is indicated. Otherwise the (7,4) Hamming code for gain and the (13,9) CRC 
(cyclic redundancy check) codes for LSF's are decoded to correct single errors and 
detect single and double errors, respectively. If an error is found in the CRC (13,9) 
10 codes, the incorrect LSF's are replaced by repeating previous LSF's or interpolating 
between the neighboring correct LSF's. 

If a frame erasure is detected in the current superframe by the Hamming 
decoder, or an erasure is directly signaled from the channel, a frame repeat mechanism 
is implemented. All the parameters of the current superframe are replaced with the 
1 5 parameters from the last frame of the previous superframe. 

For a superframe in which an erasure is not detected, the remaining parameters 
are decoded. If smoothing is necessary, the post-smoothing parameter is obtained by: 

x = 05x + 0.5x' (10) 
where x and x' represent the decoded parameter of the current frame and the 
20 corresponding parameter of the previous frame, respectively. 

4.2 Pitch Decoding 

The pitch decoding is performed as shown in Table 4. For unvoiced frames, the 
pitch value is set to 50 samples. 

4.3 LSF Decoding 

25 ^ LSF's are decoded as described in Section 4.4 and Table 5. The LSF's are 

checked for ascending order and minimum separation. 

4.4 Gain decoding 

The gain index is used to retrieve a codeword containing six gain parameters 
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from the 10-bit VQ gain codebook. 

4.5 Decoding of Bandpass Voicing 

In the unvoiced frames, all of the bandpass voicing strengths are set to zero. In 
the voiced frames, Vbpj is set to 1 and the remaining voicing patterns are decoded as 
shown in Table 2. 

4.6 Decoding of Fourier Magnitudes 

The Fourier magnitudes of unvoiced frames are set equal to 1. For the last 
voiced frame of the current superframe, the Fourier magnitudes are decoded directly. 
The Fourier magnitudes of other voiced frames are generated by repetition or linear 
interpolation as shown in Table 7. 

4.7 Aperiodic Flag Decoding 

The aperiodic flags are obtained from the new flag as shown in Table 8. The 
jitter is set to 25% if the aperiodic flag is 1, otherwise the jitter is set to 0%. 

4.8 MELP Synthesis 

The basic structure of the decoder is the same as in the MELP standard except 
that a new harmonic synthesis method is introduced to generate the excitation signal for 
each pitch cycle. In the original 2.4 kbps MELP algorithm, the mixed excitation is 
generated as the sum of the filtered pulse and noise excitations. The pulse excitation is 
computed using an inverse discrete Fourier transform (IDFT) of one pitch period in 
length and the noise excitation is generated in the time domain. In the new harmonic 
synthesis algorithm, the mixed excitation is generated completely in the frequency 
domain and then an inverse discrete Fourier transform operation is performed to convert 
it into the time domain. This avoids the need for bandpass filtering of the pulse and 
noise excitations, thereby reducing complexity of the decoder. 

In the new harmonic synthesis procedure, the excitation in the frequency domain 
is generated for each pitch cycle based on the cutoff frequency and the Fourier 
magnitude vector A /y l = 1,2,..., L . The cutoff frequency is obtained from the bandpass 
voicing parameters as previously described and it is then interpolated for each pitch 
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cycle. The Fourier magnitudes are interpolated in the same way as in the MELP 
standard. 



10 



15 



With the pitch length denoted as N, the corresponding fundamental frequency is 
described by: f 0 = In/N . The Fourier magnitude vector length is then given by: L = 
N/2. Two transition frequencies F H an& F L are determined from the cutoff frequency F 
employing an empirically derived algorithm, algorithm as follows, 



0X5F 0Hz <F< 500ife 

0.95F 500/fe < F < 1 000Hz 

0.9&F 1 000Hz <F< 2000Hz 

0.95F 2000Hz <F< 3000Hz 

0.92F 3000Hz < F <, 4000Hz 



Fl = 



1.05F 0Hz < F < 500/fe 

1 .05 F 500Hz <F<\ 000Hz 

1 .02F 1 000 Hz <F< 2000Hz 

1 .05F 2000Hz <F<: 3000Hz 

1 .00F 3000Hz <F< 4000/fe 



These transition frequencies are equivalent to two frequency component indices V H and 
V L . A voiced model is used for all the frequency samples below V L , a mixed model is 
used for frequency samples between V L and V H , and an unvoiced model is used for 
frequency samples above V H . To define the mixed mode, a gain factor g is selected 
with the value depending on the cutoff frequency (the higher the cutoff frequency F, the 
smaller the gain factor). 

1.0 OHz<F<500Hz 
0.9 500Hz <F<\ 000Hz 
g = -j 0. 8 1 000Hz <F< 2000Hz 
0.75 2000Hz <F< 3000Hz 
0.7 3000Hz <: F £ 4000Hz 
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The magnitude and phase of the frequency components of the excitation are 
determined as follows: 



4 

i-v, 



v H -v L 



■g-A,+ H 



v H -v L 



A, 



1<V, 



1>V„ 



(11) 



ZX(1) = 



Wo 

l <Po- !~ V t, <W0 



v H -v L 



i<v L 

V L <l<V H 



1>V, 



H 



(12) 



10 



15 



20 



where / is an index identifying a particular frequency component of the IDFT frequency 
range and ^ 0 is a constant selected so as to avoid a pitch pulse at the pitch cycle 
boundary. The phase 4>rndW) IS a uniformly distributed random number between -2tt 
and 2;r independently generated for each value of/. 

In other words, the spectrum of the mixed excitation signal in each pitch period 
is modeled by considering three regions of the spectrum, as determined by the cutoff 
frequency, which determines a transition interval from F L to F H . In the low region, from 
0 to F L9 the Fourier magnitudes directly determine the spectrum. In the high region, 
above F H , the Fourier magnitudes are scaled down by the gain factor g. In the transition 
region, from F L to F H , the Fourier magnitudes are scaled by a linearly decreasing 
weighting factor that drops from unity to g across the transition region. A linearly 
increasing phase is used for the low region, and random phases are used for the high 
region. In the transition region, the phase is the sum of the linear phase and a weighted 
random phase with the weight increasing linearly from 0 to 1 across the transition 
region. The frequency samples of the mixed excitation are then converted to the time 
domain using an inverse Discrete Fourier Transform. 
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5. TRANSCODER 
5.1 Concepts 

In some applications, it is important to allow interoperation between two 
different speech coding schemes. In particular, it is useful to allow interoperability 
between a 2400 bps MELP coder and a 1200 bps superframe coder. The general 
operation of a transcoder is illustrated in the block diagrams of Figures 5 A and 5B. In 
the up-converting transcoder 70 of Fig. 5 A, speech is input 72 to a 1200 bps vocoder 74 
whose output is an encoded bit stream at 1200 bps 76 which is converted by the "Up- 
Transcoder" 78 into a 2400 bps bit stream 80 in a form allowing it to be decoded by a 
2400 bps MELP decoder 82, that outputs synthesized speech 84. Conversely, in the 
down-converting transcoder 90 of FIG. 3B speech is input 92 to a 2400 bps MELP 
encoder 94, which outputs a 2400 bps bit stream 96 into a "Down-Transcoder" 98, that 
converts the parametric data stream into a 1200 bps bit stream 100 that can be decoded 
by the 1200 bps decoder 102, that outputs synthesized speech 104. In full-duplex (two- 
way) voice communication both the up-transcoder and the down-transcoder are needed 
to provide interoperability. 

A simple way to implement an up-transcoder is to decode the 1200 bps bit 
stream with a 1200 bps decoder to obtain a raw digital representation of the recovered 
speech signal which is then re-encoded with a 2400 bps encoder. Similarly, a simple 
method for implementing a down-transcoder is to decode the 2400 bps bit stream with a 
2400 bps decoder to obtain a raw digital representation of the recovered speech signal 
which is then re-encoded with a 1200 bps encoder. This approach to implementing up 
and down transcoders, corresponds to what is called "tandem" encoding and has the 
disadvantages that the voice quality is substantially degraded and the complexity of the 
transcoder is unnecessarily high. Transcoder efficiency is improved with the following 
method for transcoding that reduces complexity while avoiding much of the quality 
degradation associated with tandem encoding. 
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5.2 Down-Transcoder 

In the down-transcoder, after synchronization and channel error correction 
decoding are performed, the bits representing each parameter are separately extracted 
from the bit stream for each of three consecutive frames (constituting a superframe) and 

5 the set of parameter information is stored in a parameter buffer. Each parameter set 
consists of the values of a given parameter for the three consecutive frames. The same 
methods used to quantize superframe parameters are applied here to each parameter set 
for recoding into the lower-rate bit stream. For example, the pitch and U/V decision for 
each of 3 frames in a superframe is applied to the pitch and U/V quantization scheme 

0 described in Section 3.2. In this case, the parameter set consists of 3 pitch values each 
represented with 7 bits and 3 U/V decisions each given by 1 bit, giving a total of 24 bits. 
This is extracted from the 2400 bps bit stream and the recoding operation converts this 
into 12 bits to represent the pitch and voicing for the superframe. In this way, the 
down-transcoder does not have to perform the MELP analysis functions and only 

5 performs the needed quantization operations for the superframe. Note that the parity 
check bit, synchronization bit, and error correction bits must be regenerated as part of 
the down transcoding operation. 

5.3 Up-Transcoder 

In the case of an up-transcoder the input bit stream of 1200 bps contains 
0 quantized parameters for each superframe. After synchronization and error correction 
decoding are performed, the up-transcoder extracts the bits representing each parameter 
for the superframe which are mapped (recoded) into a larger number of bits that specify 
separately the corresponding values of that parameter for each of the three frames in the 
current superframe. The method of performing this mapping, which is parameter 
5 dependent, is described below. Once all parameters for a frame of the superframe have 
been determined, the sequence of bits representing three frames of speech are generated. 
From this data sequence, the 2400 bps bit stream is generated, after insertion of the 
synchronization bit, parity bit, and error correction encoding. 
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The following is a description of the general approach to mapping (decoding) 
the parameter bits for a superframe into separate parameter bits for each of the three 
frames. Quantization tables and codebooks are used in the 1200 bps decoder for each 
parameter as described previously. The decoding operation takes a binary word that 
5 represents one or more parameters and outputs a value for each parameter, e.g. a 

particular LSF value or pitch value as stored in a codebook. The parameter values are 
requantized, i.e. applied as input to a new quantizing operation employing the 
quantization tables of the 2400 bps MELP coder. This requantization leads to a new 
binary word that represents the parameter values in a form suitable for decoding by the 
1 0 2400 bps MELP decoder. 

As an example to illustrate the use of requantization, from the 1200 bps bit 
stream, the bits containing the pitch and voicing information for a particular superframe 
are extracted and decoded into 3 voicing (V/U) decisions and 3 pitch values for the 3 
frames in the superframe; The 3 voicing decisions are binary and are directly usable as 
15 the voicing bits for the 2400 bps MELP bitstream (one bit for each of 3 frames). The 3 
pitch values are requantized by applying each to the MELP pitch scalar quantizer 
obtaining a 7 bit word for each pitch value. Numerous alternative implementation of 
pitch requantization which follow the inventive method described can be designed by a 
person skilled in the art. 
20 One specific alteration can be created by bypassing pitch requantization when 

only a single frame of the superframe is voiced, since in this case the pitch value for the 
voiced frame is already specified in quantized form consistent with the format of the 
MELP vocoder. Similarly, for the Fourier magnitudes, requantization is not needed for 
the last frame of a superframe since it is has already been scalar quantized in the MELP 
>5 format. However the interpolated Fourier magnitudes for the other two frames of the 
superframe need to be requantized by the MELP quantization scheme. The jitter, or 
aperiodic flag, is simply obtained by table lookup using the last two columns of Table 8. 
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6. DIGITAL VOCODER TERMINAL HARDWARE 

FIG. 6 shows a digital vocoder terminal containing an encoder and decoder that 
operate in accordance with the voice coding methods and apparatus of this invention. 
The microphone MIC 1 12 is an input speech transducer providing an analog output 
5 signal 1 14 which is sampled and digitized by an Analog to Digital Converter (A/D) 1 1 6. 
The resulting sampled and digitized speech 1 18 is digitally processed and compressed 
within a DSP/controller chip 120, by the voice encoding operations performed in the 
Encode block 122, which is implemented in software within the DSP/Controller 
according to the invention. 

10 The digital signal processor (DSP) 120 is exemplified by the Texas Instruments 

TMC320C5416 integrated circuit, which contains random access memory (RAM) 
providing sufficient buffer space for storing speech data and intermediate data and 
parameters; the DSP circuit also contains read-only memory (ROM) for containing the 
program instructions, as previously described, to implement the vocoder operations. A 

1 5 DSP is well suited for performing the vocoder operations described in this invention. 
The resultant bitstream from the encoding operation 124 is a low rate bit-stream, Tx 
data stream. The Tx data 124 enters a Channel Interface Unit 126 to be transmitted 
over a channel 128. 

On the receiving side, data from a channel 128 enters a Channel Interface Unit 
20 126 which outputs an Rx bit-stream 130. The Rx data 130 is applied to a set of voice 
decoding operations within the decode block; the operations have been previously 
described. The resulting sampled and digitized speech 134, is applied to a Digital to 
Analog Converter (D/A) 136. The D/A outputs reconstructed analog speech 138. The 
reconstructed analog speech 138 is applied to a speaker 140, or other audio transducer 
25 which reproduces the reconstructed sound. 

FIG. 6 is a representation of one configuration of hardware on which the 
inventive principles may be practiced. The inventive principles may be practiced on 
various forms of vocoder implementations that can support the processing functions 
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described herein for the encoding and decoding of the speech data. Specifically the 
following are but a few of the many variations included within the scope of the 
inventive implementation: 

(a) Using Channel Interface Units which contain a voiceband data modem 
for use when the transmission path is a conventional telephone line. 

(b) Using encrypted digital signals for transmission and described for 
reception via a suitable encryption device to provide secure transmission. In this case, 
the encryption unit would also be contained in the Channel Interface Unit. 

(c) Using a Channel Interface Unit that contains a radio frequency 
modulator and demodulator for wireless signal transmission by radio waves for cases in 
which the transmission channel is a wireless radio link. 

(d) Using a Channel Interface Unit that contains multiplexing and 
demultiplexing equipment for sharing a common transmission channel with multiple 
voice and/or data channels. In this case multiple Tx and Rx signals would be connected 
to the Channel Interface Unit. 

(e) Employing discrete components, or a mix of discrete elements and 
processing elements, to replace the instruction processing operations of the 
DSP/Controller. Examples that could be employed include programmable gate arrays 
(PGAs). It must be noted that the invention can be fully reduced to practice in 
hardware, without the need of a processing element. 

Hardware to support the inventive principles need only support the data 
operations described. However, use of a DSP/processor chips are the most common 
circuits used for implementing speech coders or vocoders in the current state of the art. 

Although the description above contains many specificities, these should not be 
construed as limiting the scope of the invention but as merely providing illustrations of 
some of the presently preferred embodiments of this invention. Thus the scope of this 
invention should be determined by the appended claims and their legal equivalents. 
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Table 1. Bit Allocation of both 2.4 kbps and 1.2 kbps Coding Schemes 



Pflramptpr<; 


Bits for quantization of three frames(540 samples) 


2.4 kbps 
Voiced 


2.4kbps 
Unvoiced 


1.2kbps 
state 1 


1.2kb 
state 2 


1.2kb 
state 3 


1.2kb 
state 4 


1 .2kbps 
state 5 


Pitch & Global UV 
Decisions 


7*3 


7*3 


12 


12 


12 


12 


12 


Parity 


0 


0 


1 


1 


1 


1 


1 


LSF's 


25*3 


25*3 


42 


42 


39 


42 


27 


Gains 


8*3 


8*3 


10 


10 


10 


10 


10 


Bandpass Voicing 


4*3 


0 


6 


4 


4 


2 


0 


Fourier Magnitudes 


8*3 


0 


8 


8 


8 


8 


0 


Jitter 


1*3 


0 


1 


1 


1 


1 


0 


Synchronization 


1*3 


1*3 


1 


1 


1 


1 


1 


Error Protection 


0 


13*3 


0 


2 


5 


4 


30 


Total 


162 


162 


81 


81 


81 


81 


81 



5 *Note: 1 .2kbps State 1 : All three frames are voiced. 

1 .2kbps State 2: One of the first two frames is unvoiced, other frames are 

voiced. 

1 .2kbps State 3: The 1 st and 2 nd frames are voiced. The 3 rd frame is unvoiced. 
1.2kbps State 4: One of the three frames is voiced, other two frames are 
10 unvoiced. 

1.2kbps State 5: All three frames are unvoiced. 
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Table 2. Bandpass voicing index mapping 



Codeword: 


[0000 


1000 


1100 


llli 

X X X X 




0000 


1000 


1100 


0111 


Voicing 
patterns 
assigned to 
the codeword. 


0001 
0010 
0011 
0100 
0101 
0110 


1001 
1010 




1011 
1101 
1110 
1111 




500 Hz 


1000 Hz 


2000 Hz 


4000 Hz 


Cutoff 










Frequency 











5 

Table 3. Pitch quantization schemes 



U/V pattern 


Pitch quantization method 


U U U 


N/A 


U U V 


The pitch of the only voiced frame is scalar quantized using a 7- 
bit quantizer. 


U V U 


V U U 


U V V 


The pitches of the voiced frames are quantized using the same VQ 
as for the VW case. A weighting function is applied which takes 
into account the U/V information. 


V U V 


V V U 


V V V 


Vector quantization of three pitches 
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Table 4. Joint quantization scheme of pitch and voicing decisions 



U/V patterns 


3-bit 

codewords 


9-bit codebooks 


UUU 


000 


The pitch value is quantized with the same 99-level 
uniform quantizer as in the 2.4kbps standard. The 
pitch value and U/V pattern are then mapped to a 
codevector in this 9-bit codebook. 


UUV 


UVU 


VUU 


VVU 


001 


These U/V patterns share the same codebook 
containing 512 codevectors of the pitch triple. 


VUV 


010 


uvv 


100 


vvv 


011 


512-entry codebook A 


101 


512-entry codebook B 


110 


512-entry codebook C 


111 


512-entry codebook D 



5 Table 5. Bit allocation for LSF quantization according to UV decisions 



U/V pattern 


LSF /i 


LSF/ 2 


LSF/ 3 


Interpolati 
on 


Residual 
of /] and 

h 


Total 


UUU 


9 


9 


9 


0 


0 


27 .. 


VUU 


8+6+5+5 


9 


9 


0 


0 


42 


UVU 


9 


8+6+5+5 


9 


0 


0 


42 


UUV 


9 


9 


8+6+5+5 


0 


0 


42 


uvv 

VUV 
VVV 


0 


0 


8+6+5+5 


4 


8+6 


42 


VVU 


0 


0 


9 


4 


8+6+6+6 


39 



Table 6. Bit Allocation for bandpass voicing quantization 

10 



UV decisions pattern 


VVV 


VVU, VUV, uvv 


VUU, UVU, UUV 


UUU 


Bits for bandpass 
voicing information 


6 


4 


2 


0 
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Table 7. Fourier magnitude vector quantization 



U/V pattern 
for current 
superframe 


U/V decision for the last frame of the previous superframe 


U 


V 


UUU 


N/A 


vuu 


/, = 2(/,) 


uvu 


/ 2 = 0(/ 2 ) 


uuv 


/3 = ec/ 3 ) 


uvv 


A = Q(f,), f 2 =A 


vuv 


A = Q(A), /,=/ 3 


A = Q(AX /,=/ 0 


vvu 


/ 2 =2(/ 2 ), /,«/, 


/ 2 =2(/ 2 ), /,=^f^- 


vvv 


/ 2 = 2(/ 2 ), f\=f 2 = h 


/3=0C/j). 

?2-fo+A 7- /o+2-/ 3 



5 Table 8. Aperiodic flag quantization using 1 bit 



U/V pattern 


Quantization Procedure 


Quantization Patterns 


New flag = 0 


New flag=l 


UUU 


N/A 


JJ J 


J J J 


UUV 


If the voiced frame has aperiodic flag, 
set new flag. 


J J- 


J J J 


UVU 


J- J 


J JJ 


VUU 


-J J 


J J J 


UVV 


If the second frame has aperiodic flag, 
set new flag. 


J-- 


JJ- 


VVU 


- - J 


- J J 


vuv 


N/A 


- J- 


- J- 


vvv 


If > 1 frame has the aperiodic flag set, 
set new flag. 




J J J 
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Table 9. Mode protection schemes 



U/V pattern 


3-b codebook of 
joint quantization 
for pitch and U/V 
decisions 


Bit pattern of 
bandpass 
voicing 1 


Rit r*attf*rn of* 

bandpass 
voicing 2 


Bit pattern of 
LSF 


U U U 


000 


00 


00 


0000 


U U V 


00 


01 




U V u 


00 


10 




V u u 


00 


11 




V V u 


001 


01 




0101 ~ 


V U V 


010 


10 






U V V 


100 


11 






V V V 


011, 101, 110, 111 









Table 10. Parameter decoding schemes if a mode error is detected 



U/V 
pattern 


Corrected 

U/V 

pattern 


LSF's 


Gain 


Pitch 


Bandpass 
voicing 


Fourier 
Magnitude 


UUU 




Repeat LSF 5 s 










uuv 




of the last 


Decode 








uvu 


UUU 


frame in the 


and apply 




Set to 0 




vuu 




previous 
superframe 


smoothing 






Set to 1 all 


vvu 










Set the 


magnitudes 


vuv 




Decode and 


Decode 


Decode 


first band 


vvu 


VVV 


apply 


and apply 


and apply 


to 1, 








smoothing 


smoothing 


smoothing 


others to 
0 
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CLAIMS 

What is claimed is: 

1 . A vocoder apparatus, comprising: 

(a) a superframe buffer for receiving multiple frames of voice data; 
5 (b) a frame-based voice encoder analysis module for extracting parametric 

voice data from each frame within the superframe buffer; 

(c) a superframe encoder for receiving parametric voice data for a series of 
frames within the superframe buffer from the analysis module, wherein parametric 
voice data received from the analysis module is selectively quantized to produce voice 

10 data which is encoded into an outgoing digital bit stream for transmission; 

(d) a superframe decoder for receiving and decoding a digital bit stream 
encoded with superframe voice data into quantized frame-based parameters; and 

(e) a frame-based decoder synthesizer for receiving the quantized 
parameters for each frame and decoding the quantized parameters into a synthesized 

15 voice output. 

2. A voice compression apparatus, comprising: 

(a) a superframe buffer for receiving multiple frames of voice data; 

(b) a frame-based encoder analysis module for analyzing characteristics of 
20 voice data within frames contained in the superframe to produce an associated set of 

voice data parameters; and 

(c) a superframe encoder for receiving voice data parameters from the 
analysis module for a group of frames contained within the superframe buffer, for 
reducing by analysis data for the group of frames and for quantizing and encoding said 

25 data into an outgoing digital bit stream for transmission. 

3. A voice compression apparatus as recited in claim 2, wherein the 
analysis module is capable of receiving voice data parameters is selected from the group 
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of voice encoders consisting of linear predictive coders, mixed-excitation linear 
prediction coders, harmonic coders, and multiband excitation coders. 

4. A voice compression apparatus as recited in claim 2, wherein said 
5 superframe encoder includes at least two parametric processing modules selected from 
the group of parametric processing modules consisting of pitch smoothers, bandpass 
voicing smoothers, linear predictive quantizers, jitter quantizers, and Fourier magnitude 
quantizers. 

10 5 - A voice compression apparatus as recited in claim 2, wherein said 

superframe encoder includes a vector quantizer wherein pitch values within a 
superframe are vector quantized with a distortion measure responsive to pitch errors. 

6. A voice compression apparatus as recited in claim 2, wherein said 
15 superframe encoder includes a vector quantizer wherein pitch values within a 

superframe are vector quantized with a distortion measure responsive to pitch 
differentials as well as pitch errors. 

7. A voice compression apparatus as recited in claim 2, wherein said 
20 superframe encoder includes a quantizer of linear prediction parameters, wherein 

quantization is performed with a codebook-based interpolation of linear prediction 
parameters that employ different interpolation coefficients for each linear prediction 
parameter, and wherein said quantizer operates in closed loop mode to minimize overall 
error over a number of frames 

25 

8. A voice compression apparatus as recited in claim 7, wherein said 
quantizer is capable of performing a line spectral frequency (LSF) quantization using 
said codebook-based interpolation. 
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9. A voice compression apparatus as recited in claim 8, wherein said 
codebook is created by means of a training database operated on by a centroid-based 
training procedure. 

5 

10. A voice compression apparatus as recited in claim 2, wherein said 
superframe encoder includes a pitch smoother wherein calculations are based on an 
onset/offset classifier. 

10 1 1 • A voic e compression apparatus as recited in claim 2, wherein said 

superframe encoder includes a pitch smoother wherein pitch trajectory is calculated 
using a plurality of voicing decisions. 

12. A voice compression apparatus as recited in claim 1 1 , wherein said pitch 
15 smoother classifies frames into onset and offset frames based on at least four waveform 
feature parameters selected from the group of waveform feature parameters consisting 
of energy, zero-crossing rate, peakiness, maximum correlation coefficient of input 
speech, maximum correlation coefficient of 500 Hz low pass filtered speech, energy of 
low pass filtered speech, and energy of high pass filtered speech. 



20 



25 



13. A voice compression apparatus as recited in claim 2, wherein said 
superframe encoder includes a bandpass voicing smoother for mapping multiband 
voicing decisions for each frame into a single cutoff frequency for that frame, wherein 
said cutoff frequency takes on one value from a predetermined list of allowable values. 

14. A voice compression apparatus as recited in claim 13, wherein said 
bandpass voicing smoother performs smoothing by modifying the cutoff frequency of a 
frame as a function of the cutoff frequencies of neighboring frames and the average 
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frame energy. 

15. A voice compression apparatus as recited in claim 2, further comprising 
means for compressing aperiodic flag bits for each frame in a superframe into a single 

5 bit per superframe, which bit is created based on the distribution of voiced and unvoiced 
frames within the superframe. 

16. A voice compression apparatus as recited in claim 2, wherein said 
superframe encoder includes a plurality of quantizers for encoding parametric data into 

10 a set of bits, wherein at least one of said quantizers employs vector quantization to 
represent interpolation coefficients. 

1 7. A voice compression apparatus as recited in claim 2, wherein a 
superframe is categorized into one of a plurality of coding states based on the 

15 combination of voiced and unvoiced frames within the superframe, and wherein each of 
said coding states is associated with a different bit allocation to be used with the 
superframe. 

18. A voice compression apparatus, comprising: 

20 (a) a superframe buffer for receiving multiple frames of voice data; 

(b) a frame-based analysis module for determining a set of voice data 
parameters for said voice data; and 

(c) a superframe encoder for receiving unquantized voice data parameters 
for groups of frames within a superframes, said superframe encoder comprising 

25 (i) a pitch smoother for determining pitch and U/V decisions for 

each frame of the superframe and extracts parameters needed for frame 
classification into onset and offset frames, 

(ii) a bandpass voicing smoother for determining bandpass voicing 
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strengths for the frames within the superframe and determines cutoff frequencies 
for each frame, and 

(iii) a parameter quantizer and encoder for quantizing and encoding 
voicing parameters received from said analysis module, said pitch smoother, and 
said bandpass voicing smoother into a set of bits and encoding said bits into an 
outgoing digital bit stream for transmission. 



1 9. A voice decoder apparatus, comprising: 

(a) a superframe decoder for receiving an incoming digital bit stream as a 
1 0 series of superframes and decoding and inverse quantizing said superframes into 

quantized frame-based voice parameters; and 

(b) a frame-based decoder for receiving said quantized frame-based voice 
parameters and combining said quantized frame-based voice parameters into a 
synthesized voice output signal. 

15 

20. A method of decoding a parametric voice encoded data stream into an 
audio voice signal comprising the steps of: 

(a) buffering a received parametric voice data stream having a plurality of 
pitch periods and loading said buffered frame data into a buffer; 
20 ( b ) constructing an estimated spectrum of excitation within each pitch period 

by breaking down the frequency spectrum into regions based on cutoff frequency, 
wherein said construction comprises the steps of: 

(i) computing Fourier magnitude for each region, wherein the 
resultant computed Fourier magnitudes for at least one of said regions is then 

25 scaled by a gain factor computed for that region, 

(ii) computing phase within each region, wherein the resultant phase 
for at least one of said regions has been modified by use of a weighted random 
phase, and 
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(iii) converting said Fourier magnitude and said phase within each 
region to a time domain representation by the computation of an inverse discrete 
Fourier transform; and 

(c) generating an analog voice signal from said time domain representation. 

5 

21 . A method as recited in claim 20, wherein said regions through which the 
frequency spectrum is broken down into comprise: 

(a) a lower region wherein Fourier magnitudes directly determine the 
spectrum; 

10 (b) a transition region wherein Fourier magnitudes are scaled down by a 

linearly decreasing weighting factor that drops from unity to a nonzero positive value 
dependent on the cutoff frequency of the current frame; and 

(c) an upper region wherein Fourier magnitudes are scaled down by a 
weighting factor dependent on the cutoff frequency of the current frame. 

15 

22. An up-transcoder apparatus which receives a superframe encoded voice 
data stream and converts it to a frame-based encoded voice data stream, comprising: 

(a) a superframe buffer for collecting superframe data and extracting bits 
representing superframe parameters; 
20 0>) a decoder for inverse quantizing the bits for each set of superframe 

parameters into a set of quantized parameter values for each frame of the superframe; 
and c 

(c) a frame-based encoder for quantizing the voice parameters for each of 
the underlying frames, mapping said quantized voice parameters into frame-based data, 
25 and producing a frame-based voiced data stream. 

23. A down-transcoder apparatus which receives an encoded frame-based 
voice data stream and converts it into a superframe-based encoded voice data stream, 
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comprising: 

(a) a superframe buffer for collecting a number of frames of parametric 
voice data and extracting bits representing frame-based voice parameters; 

(b) a decoder for inverse quantizing the bits for each frame of parameter into 
5 quantized parameter values for each frame; and 

(c) a superframe encoder for collecting said quantized frame-based 
parameters for the group of frames within the superframe, producing a set of parametric 
voice data, and quantizing and encoding said parametric voice data into an outgoing 
digital bit stream. 

10 

24. A vocoder method for encoding digitized voice into parametric voice 
data, comprising the steps of: 

(a) loading multiple frames of digitized voice into a superframe buffer; 

(b) encoding digitized voice within each frame of the superframe buffer by 
1 5 parametric analysis to produce frame-based parametric voice data; 

(c) classifying frames as onset frames and offset frames by calculating pitch 
and U/V parameters within each frame of the superframe; 

(d) determining a cutoff frequency for each frame within the superframe by 
calculating a bandpass voicing strength parameter for the frames within the superframe 

20 buffer; 

(e) collecting a set of superframe parameters from the parametric analysis, 
frame classification, and cutoff frequency determination steps for the group of frames 
within the superframe; 

(f) quantizing the superframe parameters into discrete values represented by 
25 a reduced set of data bits that form quantized superframe parameter data; and 

(g) encoding quantized superframe parameter data into a data stream of 
superframe-based parametric voice data that contains substantially equivalent voice 
information to the frame-based parametric voice data, yet at a lower bit per second rate 
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of encoded voice. 

25. A vocoder method for producing digitized voice from superframe-based 
parametric voice data, comprising the steps of: 
5 (a) receiving superframe-based parametric voice data in a superframe buffer; 

(b) decoding and inverse quantizing the voice data within the superframe 
buffer to recreate a set of frame-based voice parameter values; and 

(c) decoding the frame-based voice parameters with a frame-based voice 
synthesizer which decodes the frame-based voice parameters to produce a digitized 

1 0 voice output. 
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